The data set consists of:
iTRAQ proteome profiling of 77 breast cancer samples + 3 healthy samples, with expression values for ~12.000 proteins for each sample.
A file containing the clinical data of the 77 breast cancer patients (TCGA ID, sex, age, tumor receptors, etc.).
A file containing the list of genes and proteins used by the PAM50 classification system.
The analysis of this data set is relevant for multiple potential applications: expression analysis to identify biomarkers, understand disease heterogeneity, and infer personalized treatment strategies in breast cancer.
Description of relevant variables
Dropped variables (redundant or not relevant): Survival.Data.Form, Days.to.date.of.Death, Days.to.Date.of.Last.Contact, OS.Time, Vital.Status, Tumor..T1.Coded, Metastasis.Coded, AJCC.Stage, Converted.Stage and all the columns destined to cluster the data
Created variables:
Age.Ini.Diagnostic.group: intervals of 10 years starting from 30 and going up until 90.
Age.Menopausal.group: [30, 45) Pre-menopausal, [45, 55) Menopausal, [55, 90) Post-menopausal.
ER_PR_HER2: level from 0 to 7 depending on hormonal receptors (ER, PR) present an the level of HER2.
TNBC: 0 if positive, 1 if negative
AJCC.Simp: simplified AJCC stages (I, II, III, and IV)
Description of relevant variables
# Perform k-means clustering
kmeans_PAM50_5 <- PAM50_data |>
kmeans(centers = 5,
iter.max = 10)
# Define the hulls for getting the points that define the shapes of each cluster
hulls <- PAM50_cluster_data |>
group_by(Cluster_5) |>
group_modify(~ .x[chull(.x$PC1,
.x$PC2),
])
# Plot the final comparison
PAM50_cluster_data |>
ggplot(mapping = aes(PC1,
PC2,
fill = factor(Cluster_5))) +
scale_fill_manual(values = Cluster_colors) +
scale_color_manual(values = PAM50_colors) +
geom_point(mapping = aes(color = PAM50.mRNA),
size = 1) +
geom_polygon(data = hulls,
alpha = .1) +
guides(fill = "none",
color = guide_legend(title = "PAM50 classes")) +
labs(title = "PAM50 classification compared to the predicted clusters\n(K-means clusters represented as a grey contour)",
x = "Fitted PC1 (39.54 %)",
y = "Fitted PC2 (14.55 %)") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))Two functions were created allowing to easily perform several comparisons with the present data.
DEA_proteins()
DEA_proteins <- function(data_in, condition_test){
col_name <- deparse(substitute(condition_test))
data_long <- data_in |>
dplyr::select(matches("^NP"),
matches("^XP"),
matches("^YP"),
{{ condition_test }}) |>
pivot_longer(cols = -{{ condition_test }},
names_to = "Protein",
values_to = "log2_iTRAQ")
data_long_nested <- data_long |>
group_by(Protein) |>
nest() |>
ungroup()
data_w_model <- data_long_nested |>
group_by(Protein) |>
mutate(model_object = map(.x = data,
.f = ~lm(formula = str_c("log2_iTRAQ ~", col_name) ,
data = .x)))
data_w_model <- data_w_model |>
mutate(model_object_tidy = map(.x = model_object,
.f = ~tidy(.x,
conf.int = TRUE,
conf.level = 0.95)))
estimates <- data_w_model |>
unnest(model_object_tidy) |>
filter(term == col_name) |>
ungroup() |>
dplyr::select(Protein, p.value, estimate, conf.low, conf.high) |>
mutate(q.value = p.adjust(p.value)) |>
mutate(dif_exp = case_when(q.value <= 0.05 & estimate > 0 ~ "Up",
q.value <= 0.05 & estimate < 0 ~ "Down",
q.value > 0.05 ~ 'NS'))
plt_volcano <- volcano_plot(estimates, col_name)
return(list(estimates=estimates, plt_volcano=plt_volcano))
}volcano_plot()
volcano_plot <- function(data, condition_test){
plt <- data |>
group_by(dif_exp) |>
mutate(label = case_when(dif_exp == "Up" ~ str_c(dif_exp,
" (Count: ",
n(),
")" ),
dif_exp == "Down" ~ str_c(dif_exp,
" (Count: ",
n(),
")" ),
dif_exp == "NS" ~ str_c(dif_exp))) |>
ggplot(aes(x = estimate,
y = -log10(p.value),
colour = label)) +
geom_point(alpha = 0.4,
shape = "circle") +
labs(title = str_c("Differentially expressed proteins in the test: ",
condition_test,
" vs. Non-",
condition_test),
subtitle = "Proteins highlighted in either red or blue were
\nsignificant after multiple test correction",
x = "Estimates",
y = expression(-log[10]~(p)),
color = "Differential expression") +
scale_color_manual(values = c("blue",
"grey",
"red")) +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
return(plt)
}NP_002094.2: glycogen [starch] synthase, muscle isoform 1
Triple Negative Breast Cancer (TNBC) individuals show a different protein expression profile compared to Non-Triple Negative Breast Cancer.
The protein expression profiles from breast cancer affected individuals can be differentially clustered into the PAM50 gene classification system of breast cancer subtypes.
There are at least 84 proteins down-expressed and 115 up-expressed in TNBC individuals when comparing them to non-TNBC.
There seems to be no differential expression between breast cancer tumors allocated or not allocated in lymph nodes.
A deeper study on the down- and up-expressed proteins from the TNBC VS Non-TNBC differential expression analysis could be interesting in order to define the functions and genes difference between the two type of breast cancer samples.
K-means can be developed as an unsupervised-method for medical diagnosis of Basal-like breast cancer tumors.
This project could be run again with a more accurate proteome dataset that includes, if possible, all the PAM50 genes and as few NA’s as possible.